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Abstract 

Information diffusion in online social networks is affected by the underlying network topology, but 
it also has the power to change it. Online users are constantly creating new links when exposed to 
new information sources, and in turn these links are alternating the way information spreads. However, 
these two highly intertwined stochastic processes, information diffusion and network evolution, have been 
predominantly studied separately^ ignoring their co-evolutionary dynamics. 

We propose a temporal point process model, Coevolve, for such joint dynamics, allowing the inten¬ 
sity of one process to be modulated by that of the other. This model allows us to efficiently simulate 
interleaved diffusion and network events, and generate traces obeying common diffusion and network pat¬ 
terns observed in real-world networks. Furthermore, we also develop a convex optimization framework 
to learn the parameters of the model from historical diffusion and network evolution traces. We experi¬ 
mented with both synthetic data and data gathered from Twitter, and show that our model provides a 
good fit to the data as well as more accurate predictions than alternatives. 


1 Introduction 

Online social networks, such as Twitter or Weibo, have become large information networks where people 
share, discuss and search for information of personal interest as well as breaking news [1]. In this context, 
users often forward to their followers information they are exposed to via their followees^ triggering the 
emergence of information cascades that travel through the network [2], and constantly create new links 
to information sources, triggering changes in the network itself over time. Importantly, recent empirical 
studies with Twitter data have shown that both information diffusion and network evolution are coupled 
and network changes are often triggered by information diffusion [3I1IS]. 

While there have been many recent works on modeling information diffusion liiTiiaiaii] and network 
evolution uni [III E], most of them treat these two stochastic processes independently and separately, 
ignoring the influence one may have on the other over time. Thus, to better understand information diffusion 
and network evolution, there is an urgent need for joint probabilistic models of the two processes, which are 
largely inexistent to date. 

In this paper, we propose a probabilistic generative model, Coevolve, for the joint dynamics of infor¬ 
mation diffusion and network evolution. Our model is based on the framework of temporal point processes, 
which explicitly characterizes the continuous time interval between events, and it consists of two interwoven 
and interdependent components, as shown in Figure 

I. Information diffusion process. We design an “identity revealing” multivariate Hawkes process m 
to capture the mutual excitation behavior of retweeting events, where the intensity of such events in a 
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Figure 1: Illustration of how information diffusion and network structure processes interact 


user is boosted by previous events from her time-varying set of followees. Although Hawkes processes 
have been used for information diffusion before [ni[T51[TOl[T7|[T51[T^pni|2T]. the key innovation of 
our approach is to explicitly model the excitation due to a particular source node, hence revealing 
the identity of the source. Such design reflects the reality that information sources are explicitly 
acknowledged, and it also allows a particular information source to acquire new links in a rate according 
to her “informativeness”. 


II. Network evolution process. We model link creation as an “information driven” survival process, 
and couple the intensity of this process with retweeting events. Although survival processes have been 
used for link creation before [22l [23], the key innovation in our model is to incorporate retweeting 
events as the driving force for such processes. Since our model has captured the source identity of each 
retweeting event, new links will be targeted toward information sources, with an intensity proportional 
to their degree of excitation and each source’s influence. 

Our model is designed in such a way that it allows the two processes, information diffusion and network 
evolution, unfold simultaneously in the same time scale and exercise bidirectional influence on each other, 
allowing sophisticated coevolutionary dynamics to be generated, as illustrated in Figure 

Importantly, the flexibility of our model does not prevent us from efficiently simulating diffusion and link 
events from the model and learning its parameters from real world data: 

• Efficient simulation. We design a scalable sampling procedure that exploits the sparsity of the 
generated networks. Its complexity is O(ndlogm), where n is the number of events, m is the number 
of users and d is the maximum number of followees per user. 


• Convex parameters learning. We show that the model parameters that maximize the joint likeli¬ 
hood of observed diffusion and link creation events can be efficiently found via convex optimization. 

Then, we experiment with our model and show that it can produce coevolutionary dynamics of information 
diffusion and network evolution, and generate retweet and link events that obey common information diffusion 
patterns (e.^., cascade structure, size and depth), static network patterns (e.^., node degree) and temporal 
network patterns (e.^., shrinking diameter) described in related literature [24l[T2l|25]. Finally, we show that, 
by modeling the coevolutionary dynamics, our model provides significantly more accurate link and diffusion 
event predictions than alternatives in large scale Twitter dataset [3|. 

The remainder of this article is organized as follows. We first proceed by building sufficient background 
on the temporal point processes framework in Section Then, we introduce our joint model of information 
diffusion and network structure co-evolution in Section [S] Sections [H and [5] are devoted to answer two 
essential questions: how can we generate data from the model? and how can we efficiently learn the model 
parameters from historical event data? Any generative model should be able to answer the above questions. 
In Sections HlZl and 1^ we perform empirical investigation of the properties of the model, we evaluate the 
accuracy of the parameter estimation in synthetic data, and we evaluate the performance of the proposed 
model in real-world dataset, respectively. Section reviews the related work and Section discusses some 
extensions to the proposed model. Finally, the paper is concluded in Section [TT 
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Figure 2: Illustration of information diffusion and network structure co-evolution: David’s tweet at 1:00 pm 
about a paper is retweeted by Sophie and Christine respectively at 1:10 pm and 1:15 pm to reach out to 
Jacob. Jacob retweets about this paper at 1:20 pm and 1:35 pm and then finds David a good source of 
information and decides to follow him directly at 1:45 pm. Therefore, a new path of information to him (and 
his downstream followers) is created. As a consequence, a subsequent tweet by David about a car at 2:00 
pm directly reaches out to Jacob without need to Sophie and Christine retweet. 


2 Background on Temporal Point Processes 

A temporal point process is a random process whose realization consists of a list of discrete events localized in 
time, {ti} with ti G and i G Many different types of data produced in online social networks can be 
represented as temporal point processes, such as the times of retweets and link creations. A temporal point 
process can be equivalently represented as a counting process, A^(t), which records the number of events 
before time t. Let the history 1-L{t) be the list of times of events {ti, ^ 2 ,..., up to but not including time 
t. Then, the number of observed events in a small time window [t, t + dt) of length dt is 

dN{t)= 6{t — ti)dt, (1) 

uend) 

and hence N{t) = dN(s), where S(t) is a Dirac delta function. More generally, given a function /(t), we 
can define the convolution with respect to dN{t) as 

f{t) ★ dN{t) := f{t - t) dN{T) = ~ 

The point process representation of temporal data is fundamentally different from the discrete time repre¬ 
sentation typically used in social network analysis. It directly models the time interval between events as 
random variables, avoids the need to pick a time window to aggregate events, and allows temporal events to 
be modeled in a fine grained fashion. Moreover, it has a remarkably rich theoretical support [26] . 

An important way to characterize temporal point processes is via the conditional intensity function — 
a stochastic model for the time of the next event given all the times of previous events. Formally, the 
conditional intensity function A*(t) (intensity, for short) is the conditional probability of observing an event 
in a small window [t,t -1- dt) given the history H(t), z.e., 

A*(t)dt := P {event in [t^tdt)\'H{t)} = K[dN{t)\'H{t)]^ (3) 

where one typically assumes that only one event can happen in a small window of size dt and thus dN{t) G 
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Figure 3: Illustration of the conditional density function, the conditional cumulative density function and 
the survival function 


a) Poisson process 


b) Hawkes process 


c) Survival process 
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Figure 4: Three types of point processes with a typical realization 


{0,1}. Then, given the observation until time t and a time f ^ t, we can also characterize the conditional 
probability that no event happens until t' as 


= exp A*(r)(iT^ , 


(4) 


the (conditional) probability density function that an event occurs at time t' as 

/*(t')=A*(t')^*(t'), (5) 

and the (conditional) cumulative density function, which accounts for the probability that an event happens 
before time t': 

= 1- S*{t') = r{T)dT. ( 6 ) 

Figurej^illustrates these quantities. Moreover, we can express the log-likelihood of a list of events {ti, ^ 2 ,..., 
in an observation window [0,T) as 

£ = > TogA*(tJ - / A*(T)dr, T ^ (7) 


^logA*(ti)-/ A*(T)dr, T^tr, 
i=i 


This simple log-likelihood will later enable us to learn the parameters of our model from observed data. 
Finally, the functional form of the intensity A*(t) is often designed to capture the phenomena of interests. 
Some useful functional forms we will use are [26] : 

(i) Poisson process. The intensity is assumed to be independent of the history H(t), but it can be a 
nonnegative time-varying function, ie., 

A*(t)=^(t)^0. (8) 
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(ii) Hawkes Process. The intensity is history dependent and models a mutual excitation between 
events, z.e., 

(t) = jj.dN{t) = jj.— (9) 

where, 

•= exp(—^ 0] (10) 

is an exponential triggering kernel and /i ^ 0 is a baseline intensity independent of the history. Here, 

the occurrence of each historical event increases the intensity by a certain amount determined by the 
kernel and the weight a ^ 0, making the intensity history dependent and a stochastic process by itself. 
In our work, we focus on the exponential kernel, however, other functional forms, such as log-logistic 
function, are possible, and the general properties of our model do not depend on this particular choice. 

(iii) Survival process. There is only one event for an instantiation of the process, i.e., 

r(t) = (l-N(t))g(t), (11) 

where g(t) ^ 0 and the term (1 — N(t)) makes sure A*(t) is 0 if an event already happened before t. 


Figure [^illustrates these processes. Interested reader should refer to [26] for more details on the framework 
of temporal point processes. 


3 Generative Model of Information Diffusion and Network Evo¬ 
lution 

In this section, we use the above background on temporal point processes to formulate Coevolve, our 
probabilistic model for the joint dynamics of information diffusion and network evolution. 

3.1 Event Representation 

We model the generation of two types of events: tweet/retweet events, and link creation events, eK 
Instead of just the time t, we record each event as a triplet, as illustrated in Figure [^a): 

source 

e'^ or := { u, s, t ). (12) 

t _ t 

destination time 

For retweet event, the triplet means that the destination node u retweets at time t a tweet originally 
posted by source node s. Recording the source node s reflects the real world scenario that information sources 
are explicitly acknowledged. Note that the occurrence of event e'^ does not mean that u is directly retweeting 
from or is connected to 5. This event can happen when u is retweeting a message by another node u' where 
the original information source 5 is acknowledged. Node u will pass on the same source acknowledgement to 
its followers (e.^., “I agree @a @b @c @s”). Original tweets posted by node u are allowed in this notation. 
In this case, the event will simply be = (u^u^t). Given a list of retweet events up to but not including 
time t, the history ^^^(t) of retweets by u due to source s is 

= {4 = {ui,Si,ti)\ui = u and Si = s}. (13) 

The entire history of retweet events is denoted as 

n^{t):=Du,se[m]K,{t) (14) 

For link creation event, the triplet means that destination node u creates at time t a link to source 
node 5, ie., from time t on, node u starts following node s. To ease the exposition, we restrict ourselves 
to the case where links cannot be deleted and thus each (directed) link is created only once. However, our 
model can be easily augmented to consider multiple link creations and deletions per node pair, as discussed 
in Section [^ We denote the link creation history as 
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a) Event representation b) Point and counting processes 

Figure 5: Events as point and counting processes. Panel (a) shows a trace of events generated by a tweet 
from David followed by new links Jacob creates to follow David and Sophie. Panel (b) shows the associated 
points in time and the counting process realization. 

3.2 Joint Model with Two Interwoven Components 

Given m users, we use two sets of counting processes to record the generated events, one for information 
diffusion and another for network evolution. More specifically, 

I. Retweet events are recorded using a matrix N (t) of size m x m for each fixed time point t. The (yy, s)-th 
entry in the matrix, Nus{t) ^ {0} U Z+, counts the number of retweets of u due to source s up to time 
t. These counting processes are “identity revealing”, since they keep track of the source node that 
triggers each retweet. The matrix N{t) is typically less sparse than A(t), since Nus{t) can be nonzero 
even when node u does not directly follow s. We also let dN{t) := ( dNus{t) se[m]’ 

II. Link events are recorded using an adjacency matrix A{t) of size m x m for each fixed time point t. The 
(yy,5)-th entry in the matrix, Ausif) ^ {0? 1}? indicates whether u is directly following s. Therefore, 
Aus{t) = 1 means the directed link has been created before t. For simplicity of exposition, we do not 
allow self-links. The matrix A{t) is typically sparse, but the number of nonzero entries can change 
over time. We also define dA(t) := ( dAus{t) 

Then, the interwoven information diffusion and network evolution processes can be characterized using their 
respective intensities 

^dN{t)\'H'^{t)yj'H\t)]=T*{t)dt (15) 

¥.[dA{t) I W{t) U = A*{t) dt, (16) 

where, 

r*(t) = ( iLit) )«,.eM (17) 

A*(i) = ( A:,(t) (18) 

The sign * means that the intensity matrices will depend on the joint history, l-r{t) U and hence 

their evolution will be coupled. By this coupling, we make: (i) the counting processes for link creation to 
be “information driven” and (ii) the evolution of the linking structure to change the information diffusion 
process. In the next two sections, we will specify the details of these two intensity matrices. 
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Figure 6 : The breakdown of conditional intensity functions for 1 ) information diffusion process of Jacob 
retweeting posts originated from David NjD{t); 2 ) information diffusion process of David tweeting on his 
own initiative N]j]j{t); 3) link creation process of Jacob following David Ajoit) 


3.3 Information Diffusion Process 

We model the intensity, T*{t), for retweeting events using multivariate Hawkes process [13] : 

iLit) = l[u = s]riu + I[u ^ s] dNysit )), (19) 

where ![•] is the indicator function and J-u{t) := {'T’ G [m] : Auy{t) = 1 } is the current set of followees of u. 
The term 77 ^^ ^ 0 is the intensity of original tweets by a user u on his own initiative, becoming the source of 
a cascade, and the term /3s ^ {Auv{t) dNys{t)) models the propagation of peer influence over 

the network, where the triggering kernel (t) models the decay of peer influence over time. 

Note that the retweeting intensity matrix r*(t) is by itself a stochastic process that depends on the time- 
varying network topology, the non-zero entries in A(t), whose growth is controlled by the network evolution 
process in Section |3.4[ Hence the model design captures the influence of the network topology and each 
source’s influence, /Sg, on the information diffusion process. More specifically, to compute 7^5 (t), one first 
finds the current set Tu{t) of followees of tx, and then aggregates the retweets of these followees that are due 
to source s. Note that these followees may or may not directly follow source s. Then, the more frequently 
node u is exposed to retweets of tweets originated from source 5 via her followees, the more likely she will 
also retweet a tweet originated from source s. Once node u retweets due to source 5 , the corresponding 
Nus{t) will be incremented, and this in turn will increase the likelihood of triggering retweets due to source s 
among the followers of u. Thus, the source does not simply broadcast the message to nodes directly following 
her but her influence propagates through the network even to those nodes that do not directly follow her. 
Finally, this information diffusion model allows a node to repeatedly generate events in a cascade, and is 
very different from the independent cascade or linear threshold models [27] which allow at most one event 
per node per cascade. 

3.4 Network Evolution Process 

In our model, each user is exposed to information through a time-varying set of neighbors. By doing so, 
information diffusion affects network evolution, increasing the practical application of our model to real- 
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world network datasets. The particular definition of exposure {e.g., a retweet’s neighbor) depends on the 
type of historical information that is available. Remarkably, the flexibility of our model allows for different 
types of diffusion events, which we can broadly classify into two categories. 

In the first category, events corresponds to the times when an information cascade hits a person, for 
example, through a retweet from one of her neighbors, but she does not explicitly like or forward the 
associated post. Here, we model the intensity, A*(t), for link creation using a combination of survival and 
Hawkes process: 

VsW = h„+Q!„ y] Ku:^{t)*dNys{t)\ , (20) 

\ J 

where the term 1 — Aus{t) effectively ensures a link is created only once, and after that, the corresponding 
intensity is set to zero. The term ^ 0 denotes a baseline intensity, which models when a node u decides 
to follow a source s spontaneously at her own initiative. The term aui^uj 2 {^) ^ dNys{t) corresponds to the 
retweets by node v (a followee of node u) which are originated from source s. The triggering kernel hZuj 2 {'^) 
models the decay of interests over time. 

In the second category, the person decides to explicitly like or forward the associated post and influencing 
events correspond to the times when she does so. In this case, we model the intensity, A*(t), for link creation 
as: 


KsW = W ^ W ), 


( 21 ) 


where the terms 1 — Aus{t)^ jiu ^ 0, and the decaying kernel i^uj 2 {t) play the same role as the corresponding 
ones in Equation (20). The term aui^uj 2 {^) ^ dNus{t) corresponds to the retweets of node u due to tweets 
originally published by source s. The higher the corresponding retweet intensity, the more likely u will find 
information by source s useful and will create a direct link to s. 

In both cases, the link creation intensity A*(t) is also a stochastic process by itself, which depends on the 
retweet events, be it the retweets by the neighbors of node u or the retweets by node u herself, respectively. 
Therefore, it captures the influence of retweets on the link creation, and closes the loop of mutual influence 
between information diffusion and network topology. Figure [^illustrates these two interdependent intensities. 

Intuitively, in the latter category, information diffusion events are more prone to trigger new connections, 
because, they involve the target and source nodes in an explicit interaction, however, they are also less 
frequent. Therefore, it is mostly suitable to large event datasets, as the ones we generate in our synthetic 
experiments. In contrast, in the former category, information diffusion events are less likely to inspire new 
links but found in abundance. Therefore, it is more suitable for smaller datasets, as the ones we use in our 
real-world experiments. Consequently, in our synthetic experiments we used the latter and in our real-world 
experiments, we used the former. More generally, the choice of exposure event should be made based on the 
type and amount of available historical information. 

Finally, note that creating a link is more than just adding a path or allowing information sources to take 
shortcuts during diffusion. The network evolution makes fundamental changes to the diffusion dynamics 
and stationary distribution of the diffusion process in Section 3.3 As shown in [18], given a fixed network 
structure A, the expected retweet intensity /L4s(t) at time t due to source s will depend of the network 
structure in a nonlinear fashion, z.e., 


:= IE[r*,(t)] = + u;i(A - - I)) 




( 22 ) 


where rjs G has a single nonzero entry with value rjs and is the matrix exponential. When 

t ^ oo, the stationary intensity jig = {I — A/uj)~^ r]s is also nonlinearly related to the network structure. 
Thus, given two network structures A{t) and A{t') at two points in time, which are different by a few edges, 
the effect of these edges on the information diffusion is not just an additive relation. Depending on how these 
newly created edges modify the eigen-structure of the sparse matrix A(t), their effect on the information 
diffusion dynamics can be very significant. 
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Figure 7: Ogata’s algorithm vs our simulation algorithm in simulating U interdependent point processes 
characterized by intensity functions Ai(t),..., At/(t). Panel (a) illustrates Ogata’s algorithm, which first 
takes a sample from the process with intensity equal to sum of individual intensities and then assigns it 
to the proper dimension proportionally to its contribution to the sum of intensities. Panel (b) illustrates 
our proposed algorithm, which first draws a sample from each dimension independently and then takes the 
minimum time among them. 


4 Efficient Simulation of Coevolutionary Dynamics 

We could simulate samples (link creations, tweets and retweets) from our model by adapting Ogata’s thinning 
algorithm [28], originally designed for multidimensional Hawkes processes. However, a naive implementation 
of Ogata’s algorithm would scale poorly, z.e., for each sample, we would need to re-evaluate r*(t) and A*(t). 
Thus, to draw n sample events, we would need to perform 0{nn?n^) operations, where m is the number 
of nodes. Figure [^a) schematically demonstrates the main steps of Ogata’s algorithm. Please refer to 
Appendix [A] for further details. 

Here, we design a sampling procedure that is especially well-fitted for the structure of our model. The 
algorithm is based on the following key idea: if we consider each intensity function in r*(t) and A*(t) as a 
separate point process and draw a sample from each, the minimum among all these samples is a valid sample 
for the multidimensional point process. 

As the results of this section are general and can be applied to simulate any multi-dimensional point 
process model we abuse the notation a little bit and represent U (possibly inter-dependent) point processes 
by U intensity functions Af,...,A^. In the specific case of simulating coevolutionary dynamics we have 
U = — 1) were the first and second terms are the number information diffusion and link creation 

processes, respectively. Figure [^illustrates the way in which both algorithms differ. The new algorithm has 
the following steps: 

1. Initialization: Simulate each dimension separately and find their next sampled event time. 

2. Minimization: Take the minimum among all the sampled times and declare it as the next event of the 
multidimensional process. 
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Algorithm 1 Simulation Algorithm for C DEVOLVE 

Initialization: 

Initialize the priority queue Q 
for V 14, s G [m] do 

Sample next link event from Aus (Algorithm]^ 

Q .insert{e\^g) 

Sample next retweet event from Nus (Algorithm 
Q.insert{el^) 

end for 

General Subroutine: 

t i — 0 

while t < T do 

e ^ Q.extract.mini) 

if e = (i 4 , s, t') is a retweet event then 

Update the history W^sit') = W^git) U {e} 

for V V s.t. 14 ^ do 

Update event intensity: + P 

Sample retweet event from 7^5 (Algorithm]^ 
Q.update-key 

if NOT s V then 

Update link intensity: + a 

Sample link event from A^g (Algorithm]^ 
Q.update-key {e^^^) 

end if 
end for 
else 

Update the history Husit') = 'Hug{t) U {e} 

^0 \/1> t' 

end if 
t i — d 

end while 


3. Update: Recalculate the intensities of the dimensions that are affected by this approved sample and 
re-sample only their next event. Then go to step 2. 

To prove that the new algorithm generates samples from the same distribution as Ogata’s algorithm does 
we need the following Lemma. It justifies step 2 of the above outline. 

Lemma 1 Assume we have U independent non-homogeneous Poisson processes with intensity AJ(r),..., 

Take random variable Tu equal to the time of process u’s first event after time t. Define Tmin = rLLini<^i<t 7 {tu} 
and Umin = argmini<^<^ {r^}. Then, 

(a) Tmin is the first event after time t of the Poisson process with intensity A*^^(r). In other words, 
Tmin has the same distribution as the next event (t') in Ogata’s algorithm. 

(b) Umin follows the conditional distribution ¥{umin = u\rmin = x) = To. the dimension firing 

the event comes from the same distribution as the one in Ogata’s algorithm. 

Proof (a) The waiting time of the first event of a dimension u is exponentially distributee^ random variable 

^ If random variable X is exponentially distributed with parameter r, then fx{x) = r ex^p(—rx) is its probability distribution 
function and Fx{x) = 1 — exp(—rx) is the cumulative distribution function. 
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Algorithm 2 Efficient Intensity Computation 

Global Variabels: 

Last time of intensity computation: t 
Last value of intensity computation: I 
Initialization: 
t ^ 0 
I i — jJj 

function get-intensity {t') 

I' ^ (I — fi) ex.p(—uj(t' — t)) 
t ^ t' 

I 

return / 
end function 


Algorithm 3 1-D next event sampling 

Input: Current time: t 
Output: Next event time: s 

S i — t 

X ^ A*( 5 ) (Algorithm]^ 
while 5 < T do 

9 Exponential (X) 
s ^ s g 

X ^ A*(s) (Algorithm]^ 
Rejection test: 
d ^ Uniform{0, 1) 

if d X A < A then 
return s 
else 
A = A 
end if 
end while 
return s 


]; i.e.^ Tu — t ^ Exponential ^ A*(t) We have: 

<x\x>t) = l- P(T^in > x\x > t) = I - P(min(ri ,. . . ,Tu) > x\x > t) 

u 

c,... ,Tu > x\x > t) = 1 — > x\x > t) 


= 1 — P(ri > X, 
u 


U=1 


(23) 


o / pt+x \ / pt-\-x \ 

= l-nexp(^-y^ Xl{T)dTj=l-exp\^-J^ X*^^{T)dTj. 


Therefore, Tmin — t is exponentially distributed with parameter which can be seen as the 

first event of a non-homogenous poisson process with intensity A*^^(r) after time t. 

(b) To find the distribution of Umin we have 

pt-\-x \ _ / pt-\-x 

— '^\Xmin — X) — 


/ rt+x \ / nt+x \ 

f^) = Ki.x) exp XI{t) dTj P exp A;(t) dr j 

/ rt-\-x 


) dri . 


(24) 
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After normalization we get = uW^in = x) = ^ 

'^sum\^) 


Given the above Lemma, we can now prove that the distribution of the samples generated by the proposed 
algorithm is identical to the one generated by Ogata’s method. 

Theorem 2 The sequence of samples from Ogata’s algorithm and our proposed algorithm follow the same 
distribution. 

Proof Using the chain rule the probability of observing TLt = {(U, ),•••, (^n, '^n)} is written as: 

n n 

• • • , {tn,Un)} = , (il,1tl)} = L P {(^i, Mi) |%i} (25) 

i=l i=l 

By fixing the history up to some time, say all dimensions of multivariate Hawkes process become inde¬ 
pendent of each other (until next event happens). Therefore, the above lemma can be applied to show that 
the next sample time from Ogata’s algorithm and the proposed one come from the same distribution, z.e., 
for every i, P {(U, |Ht.} is the same for both algorithms. Thus, the multiplication of individual terms is 
also equal for both. This will prove the theorem. ■ 


This new algorithm is specially suitable for the structure of our inter-coupled processes. Since social 
and information networks are typically sparse, every time we sample a new node (or link) event from the 
model, only a small number of intensity functions in the local neighborhood of the node (or the link), will 
change. This number is of 0{d) where d is the maximum number of followers/followees per node. As a 
consequence, we can reuse most of the individual samples for the next overall sample. Moreover, we can find 
which intensity function has the minimum sample time in O(logm) operations using a heap priority queue. 
The heap data structure will help maintain the minimum and find it in logarithmic time with respect to 
the number of elements therein. Therefore, we have reduced an 0{nm) factor in the original algorithm to 
0{d logm). 

Finally, we exploit the properties of the exponential function to update individual intensities for each new 
sample in 0(1). For simplicity consider a Hawkes process with intensity A*(t) = /i -h exp(—Cc;(t — 

ti)). Note that both link creation and information diffusion processes have this structure. Now, let ti < 
be two arbitrary times, we have 

A*(U+i) = (A*(U) - ji) exp(-cj(U+i - ti)) + ji. (26) 

It can be readily generalized to the multivariate case too. Therefore, we can compute the current intensity 
without explicitly iterating over all previous events. As a result we can change an 0{n) factor in the original 
algorithm to 0(1). Furthermore, the exponential kernel also facilitates finding the upper bound of the 
intensity since it always lies at the beginning of one of the processes taken into consideration. Algorithm 
summarizes the procedure to compute intensities with exponential kernels, and Algorithm shows the 
procedure to sample the next event in each dimension making use of the special property of exponential 
kernel functions. 

The simulation algorithm is shown in Algorithm By using this algorithm we reduce the complexity 
from O(n^m^) to O(ndlogm), where d is the maximum number of followees per node. That means, our 
algorithm scales logarithmically with the number of nodes and linearly with the number of edges at any 
point in time during the simulation. Moreover, events for new links, tweets and retweets are generated in a 
temporally intertwined and interleaving fashion, since every new retweet event will modify the intensity for 
link creation and vice versa. 
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5 Efficient Parameter Estimation from Coevolutionary Events 

In this section, we first show that learning the parameters of our proposed model reduces to solving a convex 
optimization problem and then develop an efficient, parameter-free Minorization-Maximization algorithm to 
solve such problem. 


5.1 Concave Parameter Learning Problem 

Given a collection of retweet events £ = {e[} and link creation events A = {e^} recorded within a time window 
[0, T), we can easily estimate the parameters needed in our model using maximum likelihood estimation. To 
this aim, we compute the joint log-likelihood £ of these events using Equation 0, ie., 

, {au} , {Vu} , {^5}) = E “ E / "fnsiT) 

evee .. 


u,sE[m] ' 


tweet / retweet 


E “ E / KsiT)dT. 


(27) 


e{eA 


u,sE[m] ' 


links 


For the terms corresponding to retweets, the log term sums only over the actual observed events while 
the integral term actually sums over all possible combination of destination and source pairs, even if there 
is no event between a particular pair of destination and source. For such pairs with no observed events, 
the corresponding counting processes have essentially survived the observation window [0,T), and the term 
— /q simply corresponds to the log survival probability. The terms corresponding to links have a 

similar structure. 

Once we have an expression for the joint log-likelihood of the retweet and link creation events, the 
parameter learning problem can be then formulated as follows: 

-£{{liu} , {««} , {??„} , 

subject to Mw > 0, frn > 0 > 0, /3s > 0 Vi4 ,5 G [m]. 


(28) 


Theorem 3 The optimization problem defined by Equation (28) is jointly eonvex. 


Proof We expand the likelihood by replacing the intensity functions into Equation (27): 


e'^es 






{t) ★ {Auiv{t) dNy, 


t=ti 


- V l[u = s]7]u [ dt + I[u ^ s] T- t A X) * dN„s{t)) dt 

^ , Jo Jq 


u,s£[m] 

e^Gw4. y v^J- 

pT pT 


(29) 


- E d'u f {1 - Aus{t)) dt + au [ {1-Aus{t)){ y] -k dNua{t)) dt 

^ r 1 d 0 do r- TT (4-\ 

u,s£[m\ v^Xu\t) 

If we stack all parameters in a vector x = {{jiu} 1 {o^u} 1 {hu} 1 {Ps})^ one can easily notice that the log- 
likelihood £ can be written as ^og{aJx) — which is clearly a concave function with respect to 

X 1301, and thus —£ is convex. Moreover, the constraints are linear inequalities and thus the domain is a 
convex set. This completes the proof for convexity of the optimization problem. ■ 
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Algorithm 4 MM-type parameter learning for Coevolve 

Input: Set of retweet events £ = {e[} and link creation events A = {e-} observed in time window [0,T) 
Output: Learned parameters {/in} , {<^u} , {Vu} ^{Ps} 

Initialization: 
for 14 ^ 1 to m do 

Initialize /in and On randomly 
end for 

for 14 ^ 1 to m do 

Eeres l[u=ui=si] 

'Hu ~ ^ 

end for 

for s ^ 1 to m do 

^ J2veTu{t)So «u;i(t)*(A^^(4) dN^sit)) dt 

end for 

while not converged do 
for i ^ 1 to 14/ do 

^ _ 

fJ'Ui +0^X1^ (*i) (^'^2 (.t)-kdNys (t)^ I 

'^vej^u- (ti) (^<^2 I 

^-7-^YT— 

^■ui (*i) v^'^2 {t) j I 

end for 

for 14 ^ 1 to 114 do 

^e\eA Hu=Ui]ui2 

ry ^ _i_ 

^ EsgM /o^( l-^-d))(^c.2 it)^dNUt)) dt 

end for 
end while 


It’s notable that the optimization problem decomposes in m independent problems, one per node 14, and 
can be readily parallelized. 


5.2 Efficient Minorization-Maximization Algorithm 

Since the optimization problem is jointly convex with respect to all the parameters, one can simply take 
any convex optimization method to learn the parameters. However, these methods usually require hyper 
parameters like step size or initialization, which may significantly influence the convergence. Instead, the 
structure of our problem allows us to develop an efficient algorithm inspired by previous work [min], which 
leverages Minorization Maximization (MM) [31] and is parameter free and insensitive to initialization. 

Our algorithm utilizes Jensen’s inequality to provide a lower bound for the second log-sum term in the 
log-likelihood given by Equation (27). More specifically, consider a set of arbitrary auxiliary variable Uij, 


where l<4<'44/,j = l,2 and ni is the number of link events, z.e., rii = \A\. Further, assume these variables 
satisfy 


V 1 < 4 < 44/ : Z/a, > 0, Uii +Ui2 = l 


(30) 
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Then, we can lower bound the logarithm in Equation (29) using Jensen’s inequality as follows: 


log + au^ E (t) -k 

V veJ^y.^(U) 

= log I Z/ji— + *^* 2 — E (KusW 

I Vil Vi2 ' 

> Uii log (^') + n 2 log I ^ 'Y^ -k dNys{t)) 1^^^, 

^ veTuiiu) 

> m log(M«i) + ^i 2 log(a«J + Vi 2 log Y 
- m log(i/ii) - t'i 2 log(i^i 2 )- 

Now, we can lower bound the log-likelihood given by Equation ( [^ as: 

&>s.' = Y iog(^«i) + E ^ iog(^sj 




6 ^ 


t=t 


) 


dt 


( 31 ) 


(32) 


+ E ^ ^°S(E„g;r * {^Uiv{t) dNys 

- E + E * {^uv{t) dN„s{t)) 

vEJ-u{t) Jq 

u,s£[m\ 

+ ^ I2ii log(M„J + *2^2 log(a„J + *2^2 log( 51 ( I^UJ 2 {t) * dNys{t))\^^_^^) 

e\eA veJ^u^iti) 

- Y *^*llog('^il) + *^*2log(z2i2) 

e\eA 

r r 

- E - ^us{t)) dt + au {I-Aus{t)){K^^{t)kdNus{t))dt 

u,se[m] do Jo 

By taking the gradient of the lower-bound with respect to the parameters, we can find the closed form 
updates to optimize the lower-bound: 

Ee-ef = Ui = Si] 


Vu = 
Ps = 

l^u ~ 


Ee^'eS Hi"® — 


E«e[m] s] UveTAt) fo '^At) * Auvit) dNAt)) dt 
T,e\eAA = Ui]’^il 

Ese[m] fo ~ ^Ms(^)) dt 

Ee‘e^ll[« = «i]*^i2 

EsG[m] fo ~ A^g{t))(yKu2 (t) * dNug(t)) dt 


(33) 

(34) 

(35) 

(36) 


Finally, although the lower bound is valid for every choice of Uij satisfying Equation (30), by maximizing 
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the lower bound with respect to the auxiliary variables we can make sure that the lower bound is tight: 

maximize!^,, {c^u} , {Vu} , {Ps} , 

subject to = 1 Vi : 1 < i < n/ (37) 

>0 Mi \ l <i <ni. 

Fortunately, the above constrained optimization problem can be solved easily via Lagrange multipliers, which 
leads to closed form updates: 


V 1 — 

(38) 

^Ui (^^2 (^) ^ dNys{t)^ \ t=t- 

^i2 = -^-7-TT-• 

(39) 


Algorithm [^summarizes the learning procedure. It is guaranteed to converge to a global optimum [371 [TB] 


6 Properties of Simulated Co-evolution, Networks and Cascades 

In this section, we perform an empirical investigation of the properties of the networks and information 
cascades generated by our model. In particular, we show that our model can generate co-evolutionary 
retweet and link dynamics and a wide spectrum of static and temporal network patterns and information 
cascades. 

6.1 Simulation Settings 

Throughout this section, if not said otherwise, we simulate the evolution of a 8 ,000-node network as well as 
the propagation of information over the network by sampling from our model using Algorithmic We set the 
exogenous intensities of the link and diffusion events to = /i = 4 x 10“^ and rju = r] = l.b respectively, 
and the triggering kernel parameter to cji = ci ;2 = 1- The parameter /i determines the independent growth 
of the network - roughly speaking, the expected number of links each user establishes spontaneously before 
time T is jiT. Whenever we investigate a static property, we choose the same sparsity level of 0.001. 

6.2 Retweet and Link Coevolution 

Figures|C^a,b) visualize the retweet and link events, aggregated across different sources, and the corresponding 
intensities for one node and one realization, picked at random. Here, it is already apparent that retweets 
and link creations are clustered in time and often follow each other. Further, Figure |C^c) shows the cross¬ 
covariance of the retweet and link creation intensity, computed across multiple realizations, for the same node, 
z.e., if f{t) and g{t) are two intensities, the cross-covariance is a function h(r) = f f{t -\-r)g{t) dt. It can be 
seen that the cross-covariance has its peak around 0 , z.e., retweets and link creations are highly correlated 
and co-evolve over time. For ease of exposition, we illustrated co-evolution using one node, however, we 
found consistent results across nodes. 

6.3 Degree Distribution 

Empirical studies have shown that the degree distribution of online social networks and microblogging sites 
follow a power law uniii], and argued that it is a consequence of the rich get richer phenomena. The degree 
distribution of a network is a power law if the expected number of nodes rrid with degree d is given by 
rrid oc d~^ ^ where 7 > 0. Intuitively, the higher the values of the parameters a and the closer the resulting 
degree distribution follows a power-law. This is because the network grows more locally. Interestingly, the 
lower their values, the closer the distribution to an Erdos-Renyi random graph [32], because, the edges are 
added almost uniformly and independently without influence from the local structure. Eigure confirms 
this intuition by showing the degree distribution for different values of (3 and a. 
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Figure 8: Coevolutionary dynamics for synthetic data, a) Spike trains of link and retweet events, b) Link 
and retweet intensities, c) Cross covariance of link and retweet intensities. 
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Figure 9: Degree distributions when network sparsity level reaches 0.001 for different P (a) values and fixed 

a = 0.1 {P = 0.1). 


6.4 Small (shrinking) Diameter 

There is empirical evidence that the diameter of online social networks and microblogging sites exhibit 
relatively small diameter and shrinks (or flattens) as the network grows [33l [TOl [24] . Figures [^a-b) show 
the diameter on the largest connected component (LCC) against the sparsity of the network over time for 
different values of a and p. Although at the beginning, there is a short increase in the diameter due to 
the merge of small connected components, the diameter decreases as the network evolves. Moreover, larger 
values of a or P lead to higher levels of local growth in the network and, as a consequence, slower shrinkage. 
Here, nodes arrive to the network when they follow (or are followed by) a node in the largest connected 
component. 

6.5 Clustering Coefficient 

Triadic closure lanniss] has been often presented as a plausible link creation mechanism. However, 
different social networks and microblogging sites present different levels of triadic closure [36] . Importantly, 
our method is able to generate networks with different levels of triadic closure, as shown by Figure [l^c-d), 
where we plot the clustering coefficient which is proportional to the frequency of triadic closure, for 
different values of a and p. 
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Figure 10: Diameter and clustering coefficient for network sparsity 0.001. Panels (a) and (b) show the 
diameter against sparsity over time for fixed a = 0.1, and for fixed f3 = 0.1 respectively. Panels (c) and (d) 
show the clustering coefficient (CC) against (3 and a, respectively. 


6.6 Network Visualization 

Figure [TT] visualizes several snapshots of the largest connected component (LCC) of two 300-node networks for 
two particular realizations of our model, under two different values of /3. In both cases, we used /i = 2 x 10“^, 
a = 1, and r] = 1.5. The top two rows correspond to /3 = 0 and represent one end of the spectrum, z.e., 
Erdos-Renyi random network. Here, the network evolves uniformly. The bottom two rows correspond to 
/3 = 0.8 and represent the other end, z.e., scale-free networks. Here, the network evolves locally, and clusters 
emerge naturally as a consequence of the local growth. They are depicted using a combination of forced 
directed and Fruchterman Reingold layout with CephQ Moreover, the figure also shows the retweet events 
(from others as source) for two nodes, A and H, on the bottom row. These two nodes arrive almost at the 
same time and establish links to two other nodes. However, node H’s followees are more central, therefore, A 
is being exposed to more retweets. Thus, node A performs more retweets than B does. It again shows how 
information diffusion is affected by network structure. Overall, this figure clearly illustrates that by careful 
choice of parameters we can generate networks with a very different structure. 

Figure [^illustrates the spike trains (tweet, retweet, and link events) for the first 140 nodes of a network 
simulated with a similar set of parameters as above and Figure shows three snapshots of the network 
at different times. First, consider node 6 in the network. After she joins the network, a few nodes begin 
to follow him. Then, when she starts to tweet, her tweets are retweeted many times by others (red spikes) 
in the figure and these retweets subsequently boost the number of nodes that link to her (Magenta spikes). 
This clearly illustrates the scenario in which information diffusion triggers changes on the network structure. 
Second, consider nodes 46 and 68 and compare their associated events over time. After some time, node 
46 becomes much more active than node 68. To understand why, note that soon after time 137, node 46 
followed node 130, which is a very central node (z.e. following a lot of people), while node 68 did not. This 
clearly illustrates the scenario in which network evolution triggers changes on the dynamics of information 
diffusion. 


6.7 Cascade Patterns 


Our model can produce the most commonly occurring cascades structures as well as heavy-tailed cascade 
size and depth distributions, as observed in historical Twitter data reported in [25]. Figure 14 


summarizes 


the results, which provide empirical evidence that the higher the a {f3) value, the shallower and wider the 
cascades. 


^http://gephi.github.io/ 
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Figure 11: Evolution of two networks: one with /3 = 0 (1st and 2nd rows) and another one with (3 = 0.8 
(3rd and 4th rows), and spike trains of nodes A and B (5th row). 
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13 


Figure 12: Coevolutionary dynamics of events for the network shown in Figure 
Information Diffusion —Network Evolution: When node 6 joins the network a few nodes follow her and 
retweet her posts. Her tweets being propagated (shown in red) turning her to a valuable source of information. 
Therefore, those retweets are followed by links created to her (shown in magenta). 

Network Evolution —> Information Diffusion: Nodes 46 and 68 both have almost the same number of 
followees. However, as soon as node 46 connects to node 130 (which is a central node and retweets very 
much) her activity dramatically increases compared to node 68. 


20 














V 




4 




*1 •y 

.. .*i--*4 if ! .:!r S 


•irp‘ *•• 




%% 4 .# 

N? '* • . 




v *. •"* 4;..«*>68 

t=125 


C’ •• ••* / ) 

J ?** *46 • > 

>•. * i?r5:.**-‘68 


t=137 


•t tr: ■ • ■’ - •^'V 

%K, 

'Y», • > 

; 


t=150 


Figure 13: Network structure in which events from Figure 12 take place, at different times. 


7 Experiments on Model Estimation and Prediction on Synthetic 
Data 

In this section, we first show that our model estimation method can accurately recover the true model 
parameters from historical link and diffusion events data and then demonstrate that our model can accurately 
predict the network evolution and information diffusion over time, significantly outperforming two state of 
the art methods lasiis] at predicting new links, and a baseline Hawkes process that does not consider 
network evolution at predicting new events. 


7.1 Experimental Setup 

Throughout this section, we experiment with our model considering m=400 nodes. We set the model 
parameters for each node in the network by drawing samples from /i^t/(0, 0.0004), (a^l/(0, 0.1), r]^U{0, 1.5) 
and ;d^[/(0, 0.1). We then sample up to 60,000 link and information diffusion events from our model using 
Algorithm and average over 8 different simulation runs. 

7.2 Model Estimation 

We evaluate the accuracy of our model estimation procedure via two measures: (i) the relative mean absolute 
error (z.e., E[|x — x|/x], MAE) between the estimated parameters (x) and the true parameters (f), (ii) the 
Kendall’s rank correlation coefficient between each estimated parameter and its true value, and (iii) test 
log-likelihood. Figure [15] shows that as we feed more events into the estimation procedure, the estimation 
becomes more accurate. 


7.3 Link Prediction 


We use our model to predict the identity of the source for each test link event, given the historical events 
before the time of the prediction, and compare its performance with two state of the art methods, which we 
denote as TRF [3] and WENG [4]. TRE measures the probability of creating a link from a source at a given 
time by simply computing the proportion of new links created from the source over all total created links 
up to the given time. WENG considers several link creation strategies and makes a prediction by combining 
these strategies. 

Here, we evaluate the performance by computing the probability of all potential links using our model, 
TRE and WENG and then compute (i) the average rank of all true (test) events (AvgRank) and, (ii) the 
success probability that the true (test) events rank among the top-1 potential events at each test time (Top- 
1). Eigure 16 summarizes the results, where we trained our model with an increasing number of events. Our 
model outperforms both TRE and WENG for a significant margin. 
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Figure 14: Distribution of cascade structure, size and depth for different a {/3) values and fixed p = 0.2 
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Figure 15: Performance of model estimation for a 400-node synthetic network. 


7.4 Activity Prediction 

We use our model to predict the identity of the node that generates each test diffusion event, given the 
historical events before the time of the prediction, and compare its performance with a baseline consisting 
of a Hawkes process without network evolution. For the Hawkes baseline, we take a snapshot of the network 
right before the prediction time, and use all historical retweeting events to fit the model. Here, we evaluate 
the performance via the same two measures as in the link prediction task and summarize the results in 
Figure [T6| against an increasing number of training events. The results show that, by modeling the network 
evolution, our model performs significantly better than the baseline. 

8 Experiments on Coevolution and Prediction on Real Data 

In this section, we validate our model using a large Twitter dataset containing nearly 550,000 tweet, retweet 
and link events from more than 280,000 users [3]. We will show that our model can capture the co¬ 
evolutionary dynamics and, by doing so, it predicts retweet and link creation events more accurately than 
several alternatives. 
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(a) Links: AR (b) Links: Top-1 (c) Activity: AR Activity: Top-1 


Figure 16: Prediction performance for a 400-node synthetic network by means of average rank (AR) and 
success probability that the true (test) events rank among the top-1 events (Top-1). 


8.1 Dataset Description & Experimental Setup 

We use a dataset that contains both link events as well as tweets/retweets from millions of Twitter users [3]. 
In particular, the dataset contains data from three sets of users in 20 days; nearly 8 million tweet, retweet, 
and link events by more than 6.5 million users. The first set of users (8,779 users) are source nodes s, for 
whom all their tweet times were collected. The second set of users (77,200 users) are the followers of the 
first set of users, for whom all their retweet times (and source identities) were collected. The third set of 
users (6,546,650 users) are the users that start following at least one user in the first set during the recording 
period, for whom all the link times were collected. 

In our experiments, we focus on all events (and users) during a 10-day period (Sep 21 2012 - 30 Sep 
2012) and used the information before Sep 21 to construct the initial social network (original links between 
users). We model the co-evolution in the second 10-day period using our framework. More specifically, in 
the coevolution modeling, we have 5,567 users in the first layer who post 221,201 tweets. In the second layer 
101,465 retweets are generated by the whole 77,200 users in that interval. And in the third layer we have 
198,518 users who create 219,134 links to 1978 users (out of 5567) in the first layer. 

We split events into a training set (covering 85% of the retweet and link events) and a test set (covering 
the remaining 15%) according to time, z.e., all events in the training set occur earlier than those in the test 
set. We then use our model estimation procedure to fit the parameters from an increasing proportion of 
events from the training data. 


8.2 Retweet and Link Coevolution 


Figures pT| visualizes the retweet and link events, aggregated across different targets, and the corresponding 
intensities given by our trained model for four source nodes, picked at random. Here, it is already apparent 
that retweets (of his posts) and link creations (to him) are clustered in time and often follow each other, 
and our fitted model intensities successfully track such behavior. Further, Figure compares the cross¬ 
covariance between the empirical retweet and link creation intensities and between the retweet and link 
creation intensities given by our trained model, computed across multiple realizations, for the same nodes. 
For all nodes, the similarity between both cross-covariances is striking and both has their peak around 0, 
z.e., retweets and link creations are highly correlated and co-evolve over time. For ease of exposition, as in 
Section we illustrated co-evolution using four nodes, however, we found consistent results across nodes. 

To further verify that our model can capture the coevolution, we compute the average value of the 
empirical cross covariance function, denoted by rucc, per user. Intuitively, one could expect that our model 
estimation method should assign higher a and/or P values to users with high nice- Figure 19 confirms this 


intuition on 1,000 users, picked at random. Whenever a user has high a and/or /3 value, she exhibits a high 
cross covariance between her created links and retweets. 


23 







0.8 


1 



0.4 



0 50 100 

Event occurrence time 


— Retweet — Link 




20 40 60 80 

Event occurrence time 


— Retweet — Link 



0 20 40 60 80 

Event occurrence time 


(a) 


(b) 


(c) 


(d) 






(e) 


(f) 


(g) 


(h) 


Figure 17: Link and retweet behavior of 4 typical users in the real-world dataset. Panels (a,c,e,g) show the 
spike trains of link and retweet events and Panels (b,d,f,h) show the estimated link and retweet intensities 
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Figure 18: Empirical and simulated cross covariance of link and retweet intensities for 4 typical users. 


8.3 Link prediction 

We use our model to predict the identity of the source for each test link event, given the historical (link and 
retweet) events before the time of the prediction, and compare its performance with the same two state of 
the art methods as in the synthetic experiments, TRF [3] and WENG [4]. 

We evaluate the performance by computing the probability of all potential links using different methods, 
and then compute (i) the average rank of all true (test) events (AvgRank) and, (ii) the success probability 
(SP) that the true (test) events rank among the top-1 potential events at each test time (Top-1). We 
summarize the results in Eigure[2Q|^a-b), where we consider an increasing number of training retweet/tweet 
events. Our model outperforms TRE and WENG consistently. Eor example, for 8 • 10^ training events, our 
model achieves a SP 2.5x times larger than TRE and WENG. 

8.4 Activity prediction 

We use our model to predict the identity of the node that generates each test diffusion event, given the 
historical events before the time of the prediction, and compare its performance with a baseline consisting 
of a Hawkes process without network evolution. Eor the Hawkes baseline, we take a snapshot of the network 
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Figure 19: Empirical cross covariance and learned model parameters for 1,000 users, picked at random 



Figure 20: Prediction performance in the Twitter dataset by means of average rank (AR) and success 
probability that the true (test) events rank among the top-1 events (Top-1). 


right before the prediction time, and use all historical retweeting events to fit the model. Here, we evaluate 
the performance the via the same two measures as in the link prediction task and summarize the results 
in Figure [^c-d) against an increasing number of training events. The results show that, by modeling the 
co-evolutionary dynamics, our model performs significantly better than the baseline. 


8.5 Model Checking 

Given all the subsequent event times generated using a Hawkes process, z.e., ti and according to the 
time changing theorem [38], the intensity integrals A(t) dt should conform to the unit-rate exponential 
distribution. Figure 21 presents the quantiles of the intensity integrals computed using intensities with 
the parameters estimated from the real Twitter data against the quantiles of the unit-rate exponential 
distribution. It clearly shows that the points approximately lie on the same line, giving empirical evidence 
that a Hawkes process is the right model to capture the real dynamics. 


9 Related Work 

In this section, we survey related works in modeling temporal networks followed by a subsection on co¬ 
evolution dynamics. Next, we review the literature on information diffusion models. Finally, we conclude 
this section by works that are closely related and are developed for almost the same goal. 

Temporal Networks. Much effort has been devoted to modeling the evolution of social networks [39j HD] 
sn Ha |43|. Of the proposed methods in characterizing link creation, triadic closure [34] is a simple but 
powerful principle to model the evolution based on shared friends. Modeling timing and rich features of 
social interactions has been attracting increasing interest in the social network modeling community [44] • 
However, most of these models use timing information as discrete indices. The dynamics of the resulting time- 
discretized model can be quite sensitive to the chosen discretization time steps; Too coarse a discretization will 
miss important dynamic features of the process, and too fine a discretization will increase the computational 
and inference costs of the algorithms. In contrast, the events we try to model tend to be asynchronous 
with a number of different time scales. [45] used rule-based methods to model the evolution of the graph 
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(a) Link process (b) Retweet process 


Figure 21: Quantile plots of the intensity integrals from the real link and retweet event time 


over time. [46] analyzed community structure over time and [47] studied the interaction of the friendship 
graph among group members and group growth. Recently, [48] used a Cox-intensity Poisson model with 
exponential random graphs to model friendship dynamics. m extended this model to the temporal sequence 
of interactions that take place in the social network, but with insufficient model flexibility, and limited 
scalability. Modeling temporal dynamics of interactions in this way provides new opportunities for identifying 
network topology at multiple scales [50] and for early detection of popular resources [511152]. However, these 
works largely fail to model the interdependency between events generated by different users, which is one of 
the focuses of our proposed framework. Most of this line of work is summarized in a recent survey [53] , with 
a short section devoted to point process based approaches. 

Co-evolution Dynamics. In machine learning and several other communities, both the dynamics on 
the network and the dynamics of the network have been extensively studied, and combining the two is a 
natural next step. For example, [54] claimed that content generation in social networks is influenced not 
just by their personal features like age and gender, but also by their social network structure. Furthermore, 
research has been done to address the co-evolution problems, for example, in the complex network literature, 
under the name of adaptive system ISSlEaEZj. The main premise is that the evolution of the topology 
depends on the dynamics of the nodes in the network, and a feedback loop can be created between the two, 
which allows dynamical exchange of information. It has been shown that adaptive networks are capable 
of self-organizing towards dynamically critical states, like phase transitions by the interplay between the 
two processes on different time scales [58]. In a different context, epidemiologists have found that nodes 
may rewire their links to try to avoid contact with the infected ones jsiini- Co-evolutionary models have 
been also developed for collective opinion formation, investigating whether the coevolutionary dynamics will 
eventually lead to consensus or fragmentation of the population m- However, this line of research tends to be 
less data-driven.Moreover, although the general nonlinear dynamic-system based methods usually address co¬ 
evolutionary phenomena that are macroscopic in nature, they lack the inference power of statistical generative 
models which are more adapted to teasing out microscopic details from the data. Finally, we would also like 
to mention a different line of research exemplified by the actor-oriented models developed by [62], where a 
continuous-time Markov chain on the space of directed networks is specified by local node-centric probabilistic 
link change rules, and MCMC and method of moments are used for parameter estimation. Hawkes processes 
we used are generally non-Markovian and making use of event history far into the past. 

Information Diffusion. The presence of timing information in event data and the ability to model such 
information bring up the interesting question of how to use the learned model for time-sensitive inference 
or decision making. Furthermore, the development of online social networks has attracted a lot of empirical 
studies of the online influence patterns of online communities [63] [64l ESI ES] , micro blogs (STIEHI and so 
on. However, these works usually consider only relatively simple models for the influence, which may not 
be very predictive. For more mathematically oriented works, based on information cascades (a special case 
of asynchronous event data) from social networks, discrete-time diffusion models have been fitted to the 
cascades IMIIZOI and used for decision making, such as identifying influencer [63], maximizing information 
spread EZIITII, and marketing planing |711|7S1|71|75|. Several recent experimental comparisons on both 
synthetic and real world data showed that continuous-time models yield significant improvement in settings 
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such as recovering hidden diffusion network topologies from cascade data [ZHUaEZ], predicting the timings 
of future events [TsuTg, finding source of information cascades [9]. Besides this, Point process modeling of 
activity in network is becoming increasingly popular [SnilHIllH2- These time-sensitive modeling and decision 
making problems can usually be framed into optimization problems and are usually difficult to solve. This 
brings up interesting optimization problems, such as efficient sub modular function optimization with provable 
guarantees IHSIEI], sampling methods [HU IHl] for inference and prediction, and convex framework proposed 
in m to make decisions to shape the activity to a variety of objectives. Furthermore, the high dimensional 
nature of modern event data makes the evaluation of objective function of the optimization problem even 
more expensive. Therefore, more accurate modeling and sophisticated algorithm needed to be designed to 
tackle the challenges posed by modern event data applications. 

The work most closely related to ours is the empirical study of information diffusion and network evolu¬ 
tion [551111 HUE]. Among them, [4] was the first to show experimental evidence that information diffusion 
influences network evolution in microblogging sites both at system-wide and individual levels. In particular, 
they studied Yahoo! Meme^ a social micro-blogging site similar to Twitter, which was active between 2009 
and 2012, and showed that the likelihood that a user u starts following a user s increases with the number 
of messages from s seen hy u. |3] investigated the temporal and statistical characteristics of retweet-driven 
connections within the Twitter network and then identified the number of retweets as a key factor to infer 
such connections. [5] showed that the Twitter network can be characterized by steady rates of change, inter¬ 
rupted by sudden bursts of new connections, triggered by retweet cascades. They also developed a method 
to predict which retweets are more likely to trigger these bursts. Finally, [87] utilized multivariate Hawkes 
process to establish a connection between temporal properties of activities and the structure of the network. 
In contrast to our work they studied the static properties, e.g., community structure and inferred the latent 
clusters using the observed activities. 

However, there are fundamental differences between the above-mentioned studies and our work. First, 
they only characterize the effect that information diffusion has on the network dynamics, but not the bidirec¬ 
tional influence. In contrast, our probabilistic generative model takes into account the bidirectional influence 
between information diffusion and network dynamics. Second, previous studies are mostly empirical and only 
make binary predictions on link creation events. For example, the work of HE] predict whether a new link 
will be created based on the number of retweets; and, [5] predict whether a burst of new links will occur 
based on the number of retweets and users’ similarity. However, our model is able to learn parameters from 
real world data, and predict the precise timing of both diffusion and new link events. 

10 Extensions 

The basic model presented in Section [^ is just a show-case of the potential of point processes in modeling 
networks and processes over them. In this section, we extend our model in a variety of ways. More specifically, 
we explain how the model can be augmented to support link removal, node birth and death, and connection 
specific parameters. We did not perform experiments with these extensions because our real-world dataset 
does not contain information regarding to link removal and node birth and death. Curating a comprehensive 
dataset that can be used in modeling all these aspects of networks is left as interesting future work. 

10.1 Link deletion 

We can generalize our model to support link deletion by introducing an intensity matrix S*(t) = (Cws(^))n sG[m] 
and model each individual intensity as a survival process. Assume is the previously defined counting 

matrix A(t), which indicates the existence of an edge at time t. Then, we introduce a new counting matrix 
A~ (t) = which indicates the lack of an edge at time t, and we define it via its intensity 

function as 

¥.[dA-{t)\'H^{t)yj'H\t)] = E.*{t)dt, (40) 
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Then, we define the intensity as 

Cusit) = AtMiCu 


v^Tu 




( 41 ) 


where the term A^g{t) guarantees that the link has positive intensity to be removed only if it already 
exists, just like the term 1 — Aus{t) in Equation (21), the parameter (u is the base rate of link deletion and 
Us ^013 (^) ^dA~g{t) is the increased link deletion intensity due to increased number of followees of u 

who decided to unfollow s. This is an excitation term due to deleted links to source 5 ; given s is unfollowed 
by some followees of then u may find s not a good source of information too. 

Given a pair of nodes (rt, s), the process starts with = 0. Whenever a link is created this process 

ends and a removal process A~g{t) starts. Similarly, when the removal process fires, the connection is removed 
and a new link creation process is instantiated. These two processes interleave until the end. 


10.2 Node birth and death 


We can augment our model to consider the number of nodes m{t) to change over time: 

m{t) = mb{t) - md{t) (42) 

where and md{t) are counting processes modeling the numbers of nodes that join and left the network 

till time t, respectively. The way we construct and md{t) guarantees that m{t) is always non-negative. 

The birth process, is characterized by a conditional intensity function 0*(t): 

¥.[dmi){t) I U 1-L\t)] = (jA{t) dt, (43) 

where 

4>*it) =e + 0 y] {t) ★ dNus (t) , (44) 

Here, e is the constant rate of arrival and 0^^ se[m(t)] dNus{t) is the increased rate of node arrival 

due to the increased activity of nodes. Intuitively, the higher the overall activity in the existing network, the 
larger the number of new users. 

The construction of the death process, md{t), is more involved. Every time a new user joins the network, 
we start a survival process that controls whether she leaves the network. Thus, we can stack all these survival 
processes in a vector, l{t) = {lu{t))ue[m]^ characterized by a multidimensional conditional intensity function 

E[dl{t)\n^{t)U'H\t)] = (T*{t)dt, (45) 


Intuitively, we expect the nodes with lower activity to be more likely to leave the network and thus its 
conditional intensity function to adopt the following form: 


= (1 - 




E * dNus{t) 

se[m{t)] 


+-I 


(46) 


where the term (1 — ensures that a node is deleted only once, history-independent 

typical rate of death, shared across nodes, which we represent by a grid of known temporal kernels, {gj{t)} 
with unknown coefficients, and the second term is capturing the effect of activity on the probability 

of leaving the network. More specifically, if a node is not active, we assume its intensity is upper bounded 
by h{t) and the most active she becomes, the lower its probability of leaving the network and the larger the 
term X]se[m(t)] ^dNus{t). The hinge function (•)+ guarantees the intensity is always positive. 

Then, given the individual death processes the total death process is 


(^) 

^d{t) = E 

U=1 

which completes the modeling of the time-varying number of nodes. 


( 47 ) 
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10.3 Incorporating features 


One can simply enrich the model by taking into account the longitudinal or static information of the net¬ 
worked data, e.g., by conditioning the intensity on additional external features, such as node attributes or 
edge types. Let us assume each user u comes with a LC-dimensional feature vector Xu including properties 
such as her age, job, location, number of followers, number of tweets, etc. 

Then, we can augment the information diffusion intensity as follows. We introduce a LC-dimensional 
link intensity parameter rju in which each dimension reflects the contribution of the corresponding element 
in the feature vector to the intensity and replace the baseline rate r]u by Similarly, we introduce 

a LC-dimensional vector l3s where each dimension has a corresponding element in the feature vector Xg 
and substitute Ps by /SsXg. Therefore, one can rewrite the original information diffusion intensity given by 
Equation (19) as: 


iLit) = = s] vZxu + II[m 7^ s] 13] 


X a 




(t) * {A 

UV (t) dN^t)), 


(48) 


Similarly, we can parameterize the coefficients of the link creation intensity by a LC-dimensional vector 
and write the counter-part of Equation ( [2Q| ) incorporating features of the node for computing the intensity: 

VsW = (1 - +««»« E (49) 


Surprisingly enough, all the results for convexity for parameter learning, and efficient simulation tech¬ 
niques are still valid for this case too. As far as the features contribute to the intensity linearly, the log- 
likelihood is concave and we can simulate the model as efficiently as the original model. 


10.4 Connection specific parameters 


Up to this point, the parameters of the link creation and removal, node birth and death and the information 
diffusion intensities depend on one end point of the interactions. Eor example f3s and r]u in the information 


diffusion intensity given by Equation (19) only depend on the source and the actor, respectively. However, 
proceeding with this example, parameters can be made connection specific, be.. Equation (19) can be restated 


=l[u = s]rius+^U ^ sj/Sus'y] _ ^ * {Auv{t) dN^a{t)) , (50) 

where r]us is the base intensity of u retweeting a tweet originated by s and Pus is the coefficient of excitement 
of u to retweet s when one of her followees retweets something from s. 

Given enough computational resources and large amounts of historical data, one can take into account 
more complex scenarios and larger and more flexible models. Eor example, the middle user, say who is 
along the path of diffusion and forwards the tweet originated from s to can also be taking into consideration, 
be., defining Psvu as the amount of increase in intensity of user u retweeting from s when user v has just 
retweeted a post from s. All desirable properties of simulation algorithm and parameter estimation method 
still hold. 


11 Conclusion and Future Works 

In this work, we proposed a joint continuous-time model of information diffusion and network evolution, which 
can capture the coevolutionary dynamics, can mimic the most common static and temporal network patterns 
observed in real-world networks and information diffusion data, and can predict the network evolution 
and information diffusion more accurately than previous state-of-the-arts. Using point processes to model 
intertwined events in information and social networks opens up many interesting venues for future. Our 
current model is just a show-case of a rich set of possibilities offered by a point process framework, which 
have been rarely explored before in large scale social network modeling. There are quite a few directions 
that remain as future work and are very interesting to explore. Eor example: 
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• A large and diverse range of point processes can also be used instead in the framework and augment the 
current model without changing the efficiency of simulation and the convexity of parameter estimation. 

• We can incorporate features from previous state of the diffusion or network structure. For example, 
one can model information overload by adding a nonlinear transfer function on top of the diffusion 
intensity, or model peer pressure by adding a nonlinear transfer function depending on the number of 
neighbors. 

• There are situations that the processes are naturally evolve in different time scales. For example, link 
dynamics is meaningful in the scale of days, however, the resolution in which information propaga¬ 
tion occurs is usually in hours or even minutes. Developing an efficient mechanism to account for 
heterogeneity in time resolution would improve the model’s ability to predict. 

• We may augment the framework to allow time-varying parameters. The simulation would not be 
affected and the estimation of time-varying interaction can still be carried out via a convex optimization 
problem m- 

• Alternatively, one can use different triggering kernels for the Hawkes processes and learn them to 
capture finer details of temporal dynamics. 

Acknowledgement. The authors would like to thank Demetris Antoniades and Constantine Dovro- 
lis for providing them with the dataset. The research was supported in part by NSF/NIH BIGDATA 
1R01GM108341, ONR N00014-15-1-2340, NSF IIS-1218749, NSF CAREER IIS-1350983. 

A Ogata’s Algorithm 

In this section, we revisit Ogata’s algorithm in more details. Consider a I/-dimensional point process in 
which each dimension u is characterized by a conditional intensity function A* (t). 

Ogata’s algorithm starts with summing the intensities, A*^^(r) = Then, assuming we have 

simulated up to time t, the next sample time, C, is the first event drawn from the non-homogenous Poisson 
process with intensity A*^^(r) which begins at time t. Here, the algorithm exploits that, given a fixed 
history, the Hawkes Process is a non-homogenous Poisson process, which runs until the next event happens. 
Then, the new event will result in an update of the intensities and a new non-homogenous Poisson process 
starts. 

It can be shown that the waiting time of a non-homogeneous Poisson process is an exponentially dis¬ 
tributed random variable with rate equal to integral of the intensity m ^ i.e. s ^ Exponential ^ 

Thus, the next sample time can be computed as 



current time waiting time for the first event 


Sampling from a non-homogenous Poisson process is not straight-forward, therefore, Ogata’s algorithm uses 
rejection sampling with a homogenous Poisson process as the proposal distribution. More in detail, given 
A = maxt<r<T A*^^(r), t' is the time of first event of homogenous Poisson Process with rate A. Then, we 
accept the sample time with probability A*^^(t')/A. Einally, the dimension firing the event is determined 
by sampling proportionally to the contribution of the intensity of that user to the total intensity, z.e., 
/Kumi'^') ^OT 1 < u < U. This procedure is iterated until we reach the end of simulation time T. 
Algorithm presents the complete procedure. 

Ogata’s algorithm would scale poorly with the dimension of the process, because, after each sample, we 
would need to re-evaluate the affected intensities and find the upper bound. As a consequence, a naive 
implementation to draw n samples require 0(Un^) time complexity, where U is the number of dimensions. 
This is because for each sample we need to find the new summation of intensities, which involves 0(U) 
individual ones, each taking 0{n) time to accumulate over this history. In our social networks application. 
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Algorithm 5 Ogata’s Algorithm 



Input: U dimensional Hawkes process Due time: T 

2: 

Output: Set of events: 1-L = {(U, ixi),..., (t^, Un)} 
t i — 0 

4: 

i ^ 0 

while t < T do 


6: 

KumiT) ^ S«=l Ki^) ' 
A ^ maxt<^<r 


8: 

s Exponential (\) 

^ Sampling next 


t' i — t T <s 

10: 

if > T then 

break 

event time 

12: 

end if 


14: 

d ~ Uniform{0, 1) 

if d X A > A then 


16: 

t e- f 

Goto [6] 

> Rejection test 

18: 

end if 



5'e-O i 


20: 

d ^ Uniform{0, 1) 

for u ^ 1 to U do 


22: 

S^SpXlif) 

if S' > d then 


24: 

i ^ i P 1 



Ui ^ u 

► Attribution test 

26: 

ti ^ t' 
t i — t^ 


28: 

Goto|6] 
end if 


30: 

end for 



Given the new event just sampled update intensity functions A* (r) 

32: 

end while 



we have m? — m point processes for link creation and m? ones for retweeting, z.e., U = 0{rin?). Therefore, 
Ogata’s algorithm takes 0{rin?in?) time complexity. 
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